AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.38)

Neural Information Processing SystemsFeb-16-2026, 02:32:24 GMT

7dfcaf4512bbf2a807a783b90afb6c09-Paper-Datasets_and_Benchmarks_Track.pdf

artificial intelligence, machine learning, natural language, (13 more...)

Country:

North America > United States > Rhode Island (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Greece (0.04)
(6 more...)

Genre: Research Report (0.68)

Industry:

Government (1.00)
Law (0.67)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsOct-10-2025, 07:13:43 GMT

7dfcaf4512bbf2a807a783b90afb6c09-Paper-Datasets_and_Benchmarks_Track.pdf

dataset, indian language, tts dataset, (11 more...)

Country:

North America > United States > Rhode Island (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Greece (0.04)
(6 more...)

Genre: Research Report (0.68)

Industry:

Government (1.00)
Law (0.67)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsMay-27-2025, 06:28:18 GMT

IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS

Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations collected in low-quality environments to generate high-quality TTS training data. Our pipeline leverages the cross-lingual generalization of denoising and speech enhancement models trained on English and applied to Indian languages. This results in IndicVoices-R (IV-R), the largest multilingual Indian TTS dataset derived from an ASR dataset, with 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages.

dataset, indicvoice-r, massive multilingual multi-speaker speech corpus, (9 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Susladkar, Onkar Kishor, Tripathi, Vishesh, Ahmed, Biddwan

Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS

arXiv.org Artificial IntelligenceOct-9-2024

This research introduces a comprehensive Bahasa text-to-speech (TTS) dataset and a novel TTS model, EnGen-TTS, designed to enhance the quality and versatility of synthetic speech in the Bahasa language. The dataset, spanning \textasciitilde55.0 hours and 52K audio recordings, integrates diverse textual sources, ensuring linguistic richness. A meticulous recording setup captures the nuances of Bahasa phonetics, employing professional equipment to ensure high-fidelity audio samples. Statistical analysis reveals the dataset's scale and diversity, laying the foundation for model training and evaluation. The proposed EnGen-TTS model performs better than established baselines, achieving a Mean Opinion Score (MOS) of 4.45 $\pm$ 0.13. Additionally, our investigation on real-time factor and model size highlights EnGen-TTS as a compelling choice, with efficient performance. This research marks a significant advancement in Bahasa TTS technology, with implications for diverse language applications. Link to Generated Samples: \url{https://bahasa-harmony-comp.vercel.app/}

dataset, speech, synthesis, (14 more...)

2410.06608

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Asia > Indonesia (0.04)
Asia > India (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceSep-9-2024

IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS

Sankar, Ashwin, Anand, Srija, Varadhan, Praveen Srinivasa, Thomas, Sherry, Singal, Mehak, Kumar, Shridhar, Mehendale, Deovrat, Krishana, Aditi, Raju, Giri, Khapra, Mitesh

Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations collected in low-quality environments to generate high-quality TTS training data. Our pipeline leverages the cross-lingual generalization of denoising and speech enhancement models trained on English and applied to Indian languages. This results in IndicVoices-R (IV-R), the largest multilingual Indian TTS dataset derived from an ASR dataset, with 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages. IV-R matches the quality of gold-standard TTS datasets like LJSpeech, LibriTTS, and IndicTTS. We also introduce the IV-R Benchmark, the first to assess zero-shot, few-shot, and many-shot speaker generalization capabilities of TTS models on Indian voices, ensuring diversity in age, gender, and style. We demonstrate that fine-tuning an English pre-trained model on a combined dataset of high-quality IndicTTS and our IV-R dataset results in better zero-shot speaker generalization compared to fine-tuning on the IndicTTS dataset alone. Further, our evaluation reveals limited zero-shot generalization for Indian voices in TTS models trained on prior datasets, which we improve by fine-tuning the model on our data containing diverse set of speakers across language families. We open-source all data and code, releasing the first TTS model for all 22 official Indian languages.

dataset, indian language, tts dataset, (11 more...)

2409.05356

Country:

North America > United States > District of Columbia > Washington (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
North America > United States > Rhode Island (0.04)
(7 more...)

Genre: Research Report (0.64)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.77)

Varadhan, Praveen Srinivasa, Sankar, Ashwin, Raju, Giri, Khapra, Mitesh M.

Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings

arXiv.org Artificial IntelligenceAug-30-2024

We release Rasa, the first multilingual expressive TTS dataset for any Indian language, which contains 10 hours of neutral speech and 1-3 hours of expressive speech for each of the 6 Ekman emotions covering 3 languages: Assamese, Bengali, & Tamil. Our ablation studies reveal that just 1 hour of neutral and 30 minutes of expressive data can yield a Fair system as indicated by MUSHRA scores. Increasing neutral data to 10 hours, with minimal expressive data, significantly enhances expressiveness. This offers a practical recipe for resource-constrained languages, prioritizing easily obtainable neutral data alongside smaller amounts of expressive data. We show the importance of syllabically balanced data and pooling emotions to enhance expressiveness. We also highlight challenges in generating specific emotions, e.g., fear and surprise.

emotion, expressive data, tts system, (13 more...)

2407.14056

Country:

Asia > India (0.05)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada > Ontario > Toronto (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Lehečka, Jan, Hanzlíček, Zdeněk, Matoušek, Jindřich, Tihelka, Daniel

Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model

arXiv.org Artificial IntelligenceJul-24-2024

In this paper, we experimented with the SpeechT5 model pre-trained on large-scale datasets. We pre-trained the foundation model from scratch and fine-tuned it on a large-scale robust multi-speaker text-to-speech (TTS) task. We tested the model capabilities in a zero- and few-shot scenario. Based on two listening tests, we evaluated the synthetic audio quality and the similarity of how synthetic voices resemble real voices. Our results showed that the SpeechT5 model can generate a synthetic voice for any speaker using only one minute of the target speaker's data. We successfully demonstrated the high quality and similarity of our synthetic voices on publicly known Czech politicians and celebrities.

dataset, speech, speecht5 model, (17 more...)

2407.17167

Country:

Europe > Czechia (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Media (0.69)
Leisure & Entertainment (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceDec-11-2022

MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset

Liang, Kailin, Liu, Bin, Hu, Yifan, Liu, Rui, Bao, Feilong, Gao, Guanglai

Text-to-Speech (TTS) synthesis for low-resource languages is an attractive research issue in academia and industry nowadays. Mongolian is the official language of the Inner Mongolia Autonomous Region and a representative low-resource language spoken by over 10 million people worldwide. However, there is a relative lack of open-source datasets for Mongolian TTS. Therefore, we make public an open-source multi-speaker Mongolian TTS dataset, named MnTTS2, for the benefit of related researchers. In this work, we prepare the transcription from various topics and invite three professional Mongolian announcers to form a three-speaker TTS dataset, in which each announcer records 10 hours of speeches in Mongolian, resulting 30 hours in total. Furthermore, we build the baseline system based on the state-of-the-art FastSpeech2 model and HiFi-GAN vocoder. The experimental results suggest that the constructed MnTTS2 dataset is sufficient to build robust multispeaker TTS models for real-world applications. The MnTTS2 dataset, training recipe, and pretrained models are released at: https://github.

artificial intelligence, dataset, machine learning, (14 more...)

2301.00657

Country:

Asia > Mongolia (0.25)
North America > United States > California > Santa Clara County > Sunnyvale (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(5 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

arXiv.org Artificial IntelligenceNov-26-2022

Contextual Expressive Text-to-Speech

Tu, Jianhong, Cui, Zeyu, Zhou, Xiaohuan, Zheng, Siqi, Hu, Kai, Fan, Ju, Zhou, Chang

The goal of expressive Text-to-speech (TTS) is to synthesize natural speech with desired content, prosody, emotion, or timbre, in high expressiveness. Most of previous studies attempt to generate speech from given labels of styles and emotions, which over-simplifies the problem by classifying styles and emotions into a fixed number of pre-defined categories. In this paper, we introduce a new task setting, Contextual TTS (CTTS). The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text. Thus, in the CTTS task, we propose to utilize such context to guide the speech synthesis process instead of relying on explicit labels of styles and emotions. To achieve this task, we construct a synthetic dataset and develop an effective framework. Experiments show that our framework can generate high-quality expressive speech based on the given context both in synthetic datasets and real-world scenarios.

artificial intelligence, dataset, speech synthesis, (14 more...)

2211.14548

Country: Asia > China (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)